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Abstract — The goal of a learner, in standard online learning, 
is to have the cumulative loss not much larger compared with 
the best-performing function from some fixed class. Numerous 
algorithms were shown to have this gap arbitrarily close to 
zero, compared with the best function that is chosen off-line. 
Nevertheless, many real-world applications, such as adaptive 
filtering, are non-stationary in nature, and the best prediction 
function may drift over time. We introduce two novel algorithms 
for online regression, designed to work well in non-stationary 
environment. Our first algorithm performs adaptive resets to 
forget the history, while the second is last-step min-max optimal 
in context of a drift. We analyze both algorithms in the worst- 
case regret framework and show that they maintain an average 
loss close to that of the best slowly changing sequence of linear 
functions, as long as the cumulative drift is sublinear. In addition, 
in the stationary case, when no drift occurs, our algorithms 
suffer logarithmic regret, as for previous algorithms. Our bounds 
improve over the existing ones, and simulations demonstrate the 
usefulness of these algorithms compared with other state-of-the- 
art approaches. 



I. Introduction 

We consider the classical problem of online learning for 
regression. On each iteration, an algorithm receives a new 
instance (for example, input from an array of antennas) and 
outputs a prediction of a real value (for example distance to the 
source). The correct value is then revealed, and the algorithm 
suffers a loss based on both its prediction and the correct 
output value. 

In the past half a century many algorithms were pro- 
posed (see e.g. a comprehensive book ||9)) for this problem, 
some of which are able to achieve an average loss arbitrarily 
close to that of the best function in retrospect. Furthermore, 
such guarantees hold even if the input and output pairs are 
chosen in a fully adversarial manner with no distributional 
assumptions. Many of these algorithms exploit first-order 
information (e.g. gradients). 

Recently there is an increased amount of interest in al- 
gorithms that exploit second order information. For exam- 
ple the second order perceptron algorithm |8|, confidence- 
weighted learning fTT) , f\3\, adaptive regularization of 
weights (AROW) ]12|, all designed for classification; and 
AdaGrad p4) and FTPRL |28| for general loss functions. 

Despite the extensive and impressive guarantees that can be 
made for algorithms in such settings, competing with the best 
fixed function is not always good enough. In many real-world 
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applications, the true target function is not fixed, but is slowly 
changing over time. Consider a filter designed to cancel echoes 
in a hall. Over time, people enter and leave the hall, furniture 
are being moved, microphones are replaced and so on. When 
this drift occurs, the predictor itself must also change in order 
to remain relevant. 

With such properties in mind, we develop new learning 
algorithms, based on second-order quantities, designed to work 
with target drift. The goal of an algorithm is to maintain an 
average loss close to that of the best slowly changing sequence 
of functions, rather than compete well with a single function. 
We focus on problems for which this sequence consists only 
of linear functions. Most previous algorithms (e.g. fT\, f23]. 



1 26 1, p7J ) designed for this problem are based on first-order 
information, such as gradient descent, with additional control 



on the norm of the weight-vector used for prediction |26| or 
the number of inputs used to define it ||6|. 

In Sec.|ll]we review three second-order learning algorithms: 
the recursive least squares (RLS) f22] algorithm, the Aggre- 
gating Algorithm for regression (AAR) (|21|, |36|), which 
can be shown to be derived based on a last-step min-max 
approach |16|, and the AROWR algorithm f35l which is a 
modification of the AROW algorithm |12| for regression. All 
three algorithms obtain logarithmic regret in the stationary 
setting, although derived using different approaches, and they 



are not equivalent in general. In Sec. Ill we formally present 
the non-stationary setting both in terms of algorithms and in 
terms of theoretical analysis. 

For the RLS algorithm, a variant called CR-RLS (pO), 
pO[, pT|) for the non-stationary setting was described, yet 



not analyzed, before. In Sec. IV we present two algorithms 



for the non-stationary setting, that build on the other two 



algorithms. Specifically, in Sec. IV-A we extend the AROWR 
algorithm for the non-stationary setting, yielding an algorithm 
called ARCOR for adaptive regularization with covariance 
reset. Similar to CR-RLS, ARCOR performs a step called 
covariance-reset, which resets the second-order information 
from time-to-time, yet it is done based on the properties of 
this covariance-like matrix, and not based on the number of 
examples observed, as in CR-RLS. 



In Sec. IV-B we derive different algorithm based on the last- 

fT6l 



step min-max approach proposed by Forster p6) and later 
used [34 1 for online density estimation. On each iteration 
the algorithm makes the optimal min-max prediction with 
respect to the regret, assuming it is the last iteration. Yet, 
unlike previous work [16], it is optimal when a drift is 
allowed. As opposed to the derivation of the last-step min- 
max predictor for a fixed vector, the resulting optimization 



2 



problem is not straightforward to solve. We develop a dy- 
namic program (a recursion) to solve this problem, which 
allows to compute the optimal last-step min-max predictor. 
We call this algorithm LASER for last step adaptive regressor 
algorithm. We conclude the algorithmic part in Sec. IV-C in 



which we compare all non-stationary algorithms head-to-head 
highlighting their similarities and differences. Additionally, 
after describing the details of our algorithms, we provide in 
Sec.|V]a comprehensive review of previous work, that puts our 
contribution in perspective. Both algorithms reduce to their 
stationary counterparts when no drift occurs. 

We then move to Sec. |VI] which summarizes our next 
contribution stating and proving regret bounds for both algo- 
rithms. We analyse both algorithms in the worst-case regret- 
setting and show that as long as the amount of average-drift is 
sublinear, the average-loss of both algorithms will converge to 
the average-loss of the best sequence of functions. Specifically, 
we show in Sec. I VI- Al that the cumulative loss of ARCOR after 
observing T examples, denoted by (ARCOR), is upper 
bounded by the cumulative loss of any sequence of weight- 
vectors {ut}, denoted by LtHui}), plus an additional term 
O (y({Uf}))^/^logT^ where V{{ut}) measures the 

differences (or variance) between consecutive weight-vectors 
of the sequence {ut}. Later, we show in Sec. VI-B a similar 
bound for the loss of LASER, denoted by (LASER), for 
which the second term is O (t'^/^ {V{{ut}))^^^^ . We empha- 
size that in both bounds the measure V{{ut}) of differences 
between consecutive weight-vectors is not defined in the same 
way, and thus, the bounds are not comparable in general. 



In Sec. VII we report results of simulations designed to 
highlight the properties of both algorithms, as well as the 
commonalities and differences between them. We conclude 
in Sec. |Vin| and most of the technical proofs appear in the 
appendix. 

The ARCOR algorithm was presented in a shorter publi- 
cation | 35J , as well with its analysis and some of its details. 
The LASER algorithm and its analysis was also presented in 
a shorter version (30). The contribution of this submission is 
three-fold. First, we provide head-to-head comparison of three 
second-order algorithms for the stationary case. Second, we 
fill the gap of second-order algorithms for the non-stationary 
case. Specifically, we add to the CR-RLS (which extends RLS) 
and design second-order algorithms for the non-stationary case 
and analyze them, building both on AROWR and AAR. Our 
algorithms are derived from different principles from each 
other, which is reflected in our analysis. Finally, we provide 
empirical evidence showing that under various conditions 
different algorithm performs the best. 

Some notation we use throughout the paper: For a symmet- 
ric matrix S we denote its jth eigenvalue by Aj(E). Similarly 
we denote its smallest eigenvalue by Amm(S) — mmj Aj(I]), 
and its largest eigenvalue by X^axC^) = maxj Aj(I]). For a 
vector M € K"^, we denote by ||m|| the ^2-norm of the vector. 
Finally, for y > Owe define clip{x, y) = sign(a;) min{|2;|, y}. 



II. Stationery Online Learning 

We focus on the regression task evaluated with the squared 
loss. Our algorithms are designed for the online setting and 
work in iterations (or rounds). On each round an online 
algorithm receives an input-vector Xt G M'^ and predicts a 
real value yt E M. Then the algorithm receives a target label 

e M associated with Xt, and uses it to update its prediction 
rule, and proceeds to the next round. 

At each iteration, the performance of the algorithm is eval- 
uated using the squared loss, (alg) = £ {yt,yt) = {yt - 2/t)^- 
The cumulative loss suffered by the algorithm over T iterations 

is, LT(alg)-Ef=i^t(alg). 

The goal of the algorithm is to have low cumulative loss 
compared to predictors from some class. A large body of 
work, which we adopt as well, is focused on linear prediction 
functions of the form f{x) = x^u where it G M'' is 
some weight-vector. We denote by ^t(it) = {xju — yt) the 
instantaneous loss of a weight-vector u. The cumulative loss 
suffered by a fixed weight-vector u is, Lt{u) = Y^J £t{u). 

The goal of the learning algorithm is to suffer low loss 
compared with the best linear function. Formally we define 
the regret of an algorithm to be 



RiT) = Lt {alg) - miLriu) 



(1) 



The goal of an algorithm is to have R{T) — o{T), such that 
the average loss will converge to the average loss of the best 
linear function u. 

Numerous algorithms were developed for this problem, 
see a comprehensive review in the book of Cesa-Bianchi 
and Lugosi f9\. Among these, a few second-order online 
algorithms for regression were proposed in recent years, which 
we summarize in Table |l] One approach for online learning is 
to reduce the problem into consecutive batch problems, and 
specifically use all previous examples to generate a classifier, 
which is used to predict current example. Recursive least 
squares (RLS) (22 1 approach, for example, sets a weight-vector 
to be the solution of the following optimization problem. 



Wt = argmm | r* * (y^ - to • x.^f 



Since the last problem grows with time, the well known 
recursive least squares (RLS) {2T\ algorithm was developed to 
generate a solution recursively. The RLS algorithm maintains 
both a vector Wt and a positive semi-definite (PSD) matrix Sf . 
On each iteration, after making a prediction i/t — xjwt-i, 
the algorithm receives the true label yt and updates. 



Wt = Wt-i 



{yt - xJwt-i)T.t-iXt 



- xTYit-iXt 



XtXf 



(2) 
(3) 



The update of the prediction vector Wt is additive, with vector 
Yit-iXt scaled by the error {yt — xJwt^i) over the norm 
of the input measured using the norm defined by the matrix 
xJ'St-iXt- The algorithm is summarized in the right column 
of Table U 

The Aggregating Algorithm for regression (AAR) (2T) , (36) , 
summarized in the middle column of Table |l] was introduced 
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by Vovk and it is similar to the RLS algorithm, except it 
shrinks its predictions. The AAR algorithm was shown to be 
last-step min-max optimal by Forster [ |16J . Given a new input 
Xt the algorithm predicts yx which is the minimizer of the 
following problem, 



arg mm max 

Vt Vt 



■ T 

E 

.t=i 



{yt-yt)-^^Uh\\ur + Lt{u) 



(4) 



Forster proposed also a simpler analysis with the same regret 
bound. 

Finally, the AROWR algorithm p5| is a modification of the 
AROW algorithm 1 12 1 for regression. In a nutshell, the AROW 
algorithm maintains a Gaussian distribution parameterized by 
a mean Wt G and a full covariance matrix St e M'^^''. 
Intuitively, the mean Wt represents a current linear function, 
while the covariance matrix captures the uncertainty in the 
linear function Wt- Given a new example (ajt, yt) the algorithm 
uses its current mean to make a prediction yt = xjwt-i- 
AROWR then sets the new distribution to be the solution of 
the following optimization problem. 



argmin Dkl (AA (to, E) || TV (wt^i, Et_i)) 



2r 



(yt - w'^xt 



2r 



[Xt T,xt 



(5) 



This optimization problem is similar to the one of AROW f l2^ 
for classification, except we use the square loss rather than 
squared-hinge loss used in AROW. Intuitively, the optimization 
problem trades off between three requirements. The first term 
forces the parameters not to change much per example, as 
the entire learning history is encapsulated within them. The 
second term requires that the new vector Wt should perform 
well on the current instance, and finally, the third term reflects 
the fact that the uncertainty about the parameters reduces as 
we observe the current example Xf. 

The weight vector solving this optimization problem (details 
given by Vaits and Crammer p5)) is given by. 



Vt 



Wt-l ■ Xt 



xjj:t-ixt 



and the optimal covariance matrix is. 



XtxJ 



(6) 



(7) 



The algorithm is summarized in the left column of Table [I] 
Comparing AROW to RLS we observe that while the update 
of the weights of ^ is equivalent to the update of RLS in (|2|, 
the update of the matrix ^ for RLS is not equivalent to (|7|, as 
in the former case the matrix goes via a multiplicative update 
as well as additive, while in (|7]i the update is only additive. The 
two updates are equivalent only by setting r = 1. Moving to 
AAR, we note that the update rules for Wt and Ef in AROWR 
and AAR are the same if we define E^^^^^/r, but 

AROWR does not shrink its predictions as AAR. Thus all three 
algorithms are not equivalent, although very similar. 



III. NON- Stationary Online Learning 

All previous algorithms assume both by design and analysis 
that the data is stationary. The analysis of all algorithms 
compares their performance to that of a single fixed weight 
vector u, and all suffer regret that is logarithmic is T. 

We use an extended notion of evaluation, comparing our 
algorithms to a sequence of functions. We define the loss 
suffered by such a sequence to be, 

T 

Lt(m1, ■ • ■,Ut) = LT{{Ut}) = ^^t(Mt) , 

t 

and the regret is then defined to be, 

R{T) = LT(alg) - inf Lriiut}) . (8) 

We focus on algorithms that are able to compete against 
sequences of weight-vectors, (ui, . . . , Ut) G M'' x • • • x W^, 
where Ut is used to make a prediction for the Ith example 

{xt,yt)- 

Clearly, with no restriction over the set {ut} the right term 
of the regret can easily be zero by setting, Ut = Xt{yt/ ||a;t||^), 
which implies £t{ut) = for all t. Thus, in the analysis 
below we will make use of the total drift of the weight-vectors 
defined to be. 



T-l 



— Vrp 



{{Ut}) = ||Mt 



where P e {1,2}. 

For all three algorithms, as was also observed previously 
in the context of CW fl3\, AROW fH), AdaGrad y4| and 
FTPRL |j28], the matrix S can be interpreted as adaptive 
learning rate. As these algorithms process more examples, 
that is larger values of t, the eigenvalues of the matrix 
increase, and the eigenvalues of the matrix decrease, and 
we get that the rate of updates is getting smaller, since the 
additive term Y,t-iXt is getting smaller. As a consequence 
the algorithms will gradually stop updating using current 
instances which lie in the subspace of examples that were 
previously observed numerous times. This property leads to a 
very fast convergence in the stationary case. However, when 
we allow these algorithms to be compared with a sequence of 
weight-vectors, each applied to a different input example, or 
equivalently, there is a drift or shift of a good prediction vector, 
these algorithms will perform poorly, as they will converge and 
not be able to adapt to the non-stationarity nature of the data. 

This phenomena motivated the proposal of the CR-RLS 
algorithm (||T0|, pO| , pT]), which re-sets the covariance 
matrix every fixed number of input examples, causing the 
algorithm not to converge or get stuck. The pseudo-code of 
CR-RLS algorithm is given in the right column of Table [II] The 
only difference of CR-RLS from RLS is that after updating 
the matrix Ef, the algorithm checks whether Tq (a predefined 
natural number) examples were observed since the last restart, 
and if this is the case, it sets the matrix to be the identity 
matrix. Clearly, if Tq = oo the CR-RLS algorithm is reduced 
to the RLS algorithm. 
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TABLE I 

Algorithms for stationary setting 







AROWR 


AAR 


RES 


Parameters 




< r 


< 6 


< r- < 1 


Initialize 




Wo ^0 , Eo = / 


lOo = , Eo = b-^I 


Wo = , Eo = / 


For 
t = 1...T 




Receive an instance Xt 


Output 
prediction 


yt = xjwt-i 


xjwt-i 
1 + xjT,t-iXt 


yt = xjwt-i 




Receive a correct label yt 


Update 


E( — E(_j^ H XtXt 
r 


Et =T^t_i+xtXt 


Et =rT.t-i+xtXt 


Update 
Wt: 


Wt = Wt-1 


Wt = Wt-l 

(yt-a:J wt-i)St-ia!t 
l + a:J St-iict 


Wt = Wt-1 

1 (yt-!Ej ■uJt-i)St-isi:t 


Output 




Wt , Ey 


Wt , Et 


Wt , Et 


Extension to non-stationary 
setting 




ARCOR Sec. IV-A below 


LASER Sec.|IV-B| 
below 


CR-RES (10), |20), (31J 


Analysis 




yes, Sec. |VI-A| 
below 


yes, Sec. |VI-B| 
below 


No 



IV. Algorithms for Non-Stationary Regression 



In this work we fill the gap and propose extension to non- 
stationary setting for the two other algorithms in Table |l] 
Similar to CR-RLS, both algorithms modify the matrix Sf to 
prevent its eigen-values to shrink to zero. The first algorithm. 



described in Sec. IV-A extends AROWR to the non-stationary 
setting and is similar in spirit to CR-RLS, yet the restart 
operations it performs depend on the spectral properties of the 
covariance matrix, rather than the time index t. Additionally, 
this algorithm performs a projection of the weight vector into 
a predefined ball. Similar technique was used in first order 
algorithms by Herbster and Warmuth f23l, and Kivinen and 
Warmuth p5|. Both steps are motivated both from the design 
and analysis of AROWR. Its design is composed of solving 
small optimization problems defined in (|5]l, one per input 
example. The non-stationary version performs explicit correc- 
tions to its update, in order to prevent from the covariance 
matrix to shrink to zero, and the weight-vector to grow too 
fast. 



The second algorithm described in Sec. IV-B is based on 



a last-step min-max prediction principle and objective, where 
we replace Lt{u) in (|4]) with LtHui}) and some additional 
modifications preventing the solution being degenerate. Here 
the algorithmic modifications from the original AAR algorithm 
are implicit and are due to the modifications of the objective. 
The resulting algorithm smoothly interpolates the covariance 
matrix with a unit matrix. 



A. ARCOR: Adaptive regularization of weights for Regression 
with COvariance Reset 

Our first algorithm is based on the AROWR. We propose 
two modifications to (|6]l and (|7]i, which in combination over- 
come the problem that the algorithm's learning rate gradually 
goes to zero. The modified algorithm operates on segments of 
the input sequence. In each segment indexed by i, the algo- 
rithm checks whether the lowest eigenvalue of St is greater 
than a given lower bound A,. Once the lowest eigenvalue of 
St is smaller than Aj the algorithm resets St = / and updates 
the value of the lower bound A^+i. Formally, the algorithm 
uses the update (|7]i to compute an intermediate candidate for 
St, denoted by 

-1 



1 



(9) 



If indeed St ^ then it sets Sj = St> otherwise it sets 
St = / and the segment index is increased by 1. 

Additionally, before our modification, the norm of the 
weight vector Wt did not increase much as the "effective" 
learning rate (the matrix St) went to zero. After our update, 
as the learning rate is effectively bounded from below, the 
norm of Wt may increase too fast, which in turn will cause a 
low update-rate in non-stationarity inputs. 

We thus employ additional modification which is exploited 
by the analysis. After updating the mean Wt as in (|6| 

(yt - a;t^u;t-i)St_ia;t 



Wt = Wt-l 



(10) 



we project it into a ball B around the origin of radius 
Rb using a Mahalanobis distance. Formally, we define the 
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function prqj(io, S, i?^) to be the solution of the following 
optimization problem, 

arg min - (w — w)'^ {w — w) 

\\w\\<Rb 2 

We write the Lagrangian, 

C^^iw-wf {w-w)+a - 

Setting the gradient with respect to w to zero we get, 
{w ~ w) + aw — 0. Solving for w we get 



w ^ (al + 



From KKT conditions we get that if ||u;|| < Rb then a = 
and w — w. Otherwise, a is the unique positive scalar 
that satisfies || (/ + aS) ^ w\\ = Rb- The value of a can 
be found using binary search and eigen-decomposition of the 
matrix E. We write explicitly S = FAV^^ for a diagonal 
matrix A. By denoting u = V^w we rewrite the last equation, 
II (/ + aA)~^it|| = Rb- We thus wish to find a such that 

{i+aA- )^ ~ ^B- done using a binary search for 

a € [0,a] where a = {\\u\\ / Rb — 1) / - To summarize, 
the projection step can be performed in time cubic in d and 
logarithmic in Rb and A^. 

We call the algorithm ARCOR for adaptive regularization 
with covariance reset. A pseudo-code of the algorithm is 
summarized in the left column of Table [III We defer a 
comparison of ARCOR and CR-RLS after the presentation 
of our second algorithm now. 



are far from each other, and for the norm of the first to be far 
from zero. 

We develop the algorithm by solving the three optimiza- 
tion problems in ( fTT) , first, minimizing the inner term, 
min„j Qt {ui, ut), maximizing over y^, and finally, 
minimizing over yx- We start with the inner term for which 
we define an auxiliary function. 



Pt{ut)= min Qt{ui,- - - ,ut) 
iii,...,itt-i 



which clearly satisfies. 



min Qt ... ,1x4) = minPt{ut) . 

Ul,...,Ut Ut 



The following lemma states a recursive form of the function- 
sequence Pt{Ut)- 

Lemma 1: For t = 2, 3, . . . 

Pt (lit) = min (Pt-i {ut^i)+c\\ut - Ut-i\\^+(yt-uJ Xt) 



The proof appears in Sec. |A] Using Lemma [T] we write 
explicitly the function Pt{ut)- 

Lemma 2: The following equality holds 



Pt (ut) = ujDtUt - 2ujet + ft 



(13) 



where. 



B. Last-Step Min-Max Algorithm for N on- stationary Setting 

Our second algorithm is based on a last-step min-max 
predictor proposed by Forster |[T6| and later modified by 
Moroshko and Crammer 1 29 1 to obtain sub-logarithmic regret 



in the stationary case. On each round, the algorithm predicts 
as it is the last round, and assumes a worst case choice of yt 
given the algorithm's prediction. 

We extend this rule for the non- stationary setting given in 
(|4|, and re-define the last-step minmax predictor yx to b^ 



arg mm max 

Vt Vt 



■ T 

E 

.t=i 



[yt - Vt) 



- mm t| 

Ul,..,ltT 



't (Ml, ■■•,MtJ 



(11) 



where. 



t-i 

Qt {ui, --.,ut) =6|1mi||^ + \\us+i - Us\\^ 

s=l 

t 

+ , (12) 

for some positive constants 6, c. The first term of ( fTTj ) is the 
loss suffered by the algorithm while Qt (1x1, . . . ,1x4) defined 
in ( [T2] l is a sum of the loss suffered by some sequence of linear 
functions (iti, . . . ,Ut) and a penalty for consecutive pairs that 

'j/T and i/T serve both as quantifiers (over the min and max operators, 
respectively), and as the optimal arguments of this optimization problem. 



Di=bl + xixj , Dt = {D-\ + c-^l) + xtx] (14) 



ei=yixi, et = (l + c ^A-i) ^ et-i + ytXt 



(15) 



fi^yl, ft^ ft-i-eJ_,{cI + Dt-i) ^et_i+ 2/2(16) 

Note that Dt G R''-^'^ is a positive definite matrix, Cf e M'^^^ 
and ft E M. The proof appears in Sec.|B] From Lemma |2] we 
conclude that, 

min Qt{ui,. - - , Ut) = min Pt {ut) 

Ui,...,Ut Ut 

in (ujDtUt - 2ujet + ft) = -ejDtT^et + ft ■ 



mm 

Ut 



(17) 



Substituting (Tf) back in ( fTT) we get that the last-step minmax 
predictor is given by. 



ijT — arg mm max 

Vt Vt 



(18) 



Since et depends on yx we substitute ([TSj in the second term 
of ([T8]l, 



I + c ^Dt^i) ^er-i+yrXT 



(19) 
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TABLE 11 

ARCOR, LASER AND CR-RLS ALGORITHMS 







ARCOR 


LASER 


CR-RLS 


Parame- 
ters 




< r, , a sequence 

1 > Ai > Aa... 


< 6 < c 


< r- < l.To G N 


Initialize 




Wo = , Eo = / , i = 1 


wo = , Eo = 


wo^O , Eo = 7 


For 
t = 1...T 




Receive an instance xt 


Output 
prediction 


yt = xjwt-i 


xjwt-i 
^* ^ 1 + cc7 (Et_i + c-iJ) Xt 






Receive a correct label yt 


Update 


Et = Ei_i + -xtxt 
r 

If Ei >r A,7 set Ei = Ej 
else set Ei = / , i = i + 1 


E7' = (Et„i+c-'7) '+a;fcc7 


Et = rEi_i + ajtccj 

If mod (t,ro) > 

set Et = Et 
else set Et = 7 


Update 


Wt = lUt-l 


Wt = Wt-1 

(yt-ii!7'"'f-l)(St_i+c"l/)ii!( 


Wt = Wt-1 

1 (yt-a!7-uJt_i)Et_ia!t 


r+a:J St_ia!t 

Wt = proj (wt, T,t,RB) 


l+ii!7(Et_i+c-ll)a!t 




Output 




Wt , Et 


Wt , Et 


Wt , Et 



Substituting ([19]) and ( [T6] l in ( fTS) and omitting terms not 
depending explicitly on yx and yx we get. 



yT = arg mm max 

Vt Vt 



2yTX^Dr^^ (I + c-^Dt-i) bt-i - Vt 



arg mm max 

Vt Vt 



2yT (ajjDy^ (/ + c-^Dt-i)'^ bt- 



VT 



(20) 



Vt 



The last equation is strictly convex in yT and thus the optimal 
solution is not bounded. To solve it, we follow an approach 
used by Forster in a different context |16|. In order to make 
the optimal value bounded, we assume that the adversary can 
only choose labels from a bounded set yT € [— y,y]. Thus, 
the optimal solution of ( |20| l over yT is given by the following 
equation, since the optimal value is yT € {+Y, —Y}, 



ijT = arg mm 

Vt 



[x^tD-^Xt) Y"^ 

2Y X^D^' {I + C-'DT-iy^ BT-i - VT +yT 



This problem is of a similar form to the one discussed by 
Forster |16|, from which we get the optimal solution, yT ~ 



clip ^a;J 



T-l 



) BT-l,Y 



The optimal solution depends explicitly on the bound Y, 
and as its value is not known, we thus ignore it, and define 



the output of the algorithm to be, 

ijT = X^D:j.^ (I + C-^DT-iy^ Bt-1 

= x^D^D't.iBt-i , (21) 
where we define 

-1 



D[-i = {l 



(22) 



We call the algorithm LASER for last step adaptive regressor 
algorithm. Clearly, for c = oo the LASER algorithm reduces 
to the AAR algorithm. Similar to CR-RLS and ARCOR, this 
algorithm can be also expressed in terms of weight-vector Wt 
and a PSD matrix Sf, by denoting Wt = D^^Bt and St = 
D^"^. The algorithm is summarized in the middle column of 
Table lUl 

C. Discussion 

Table |ll] enables us to compare the three algorithms head- 
to-head. All algorithms perform linear predictions, and then 
update the prediction vector Wt and the matrix Sf. CR-RLS 
and ARCOR are more similar to each other, both stem from 
a stationary algorithm, and perform resets from time-to-time. 
For CR-RLS it is performed every fixed time steps, while for 
ARCOR it is performed when the eigenvalues of the matrix (or 
effective learning rate) are too small. ARCOR also performs a 
projection step, which is motivated to ensure that the weight- 
vector will not grow to much, and is used explicitly in the 
analysis below. Note that CR-RLS (as well as RLS) also uses 
a forgetting factor (if r < 1). 

Our second algorithm, LASER, controls the covariance 
matrix in a smoother way. On each iteration it interpolates 
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it with the identity matrix before adding XtxJ . Note that if 
A is an eigenvalue of then A x (c/(A + c)) < A is an 

eigenvalue of (St-i + c^^l) . Thus the algorithm implicitly 
reduce the eigenvalues of the inverse covariance (and increase 
the eigenvalues of the covariance). 

Finally, all three algorithms can be combined with Mercer 
kernels as they employ only sums of inner- and outer-products 
of its inputs. This allows them to perform non-linear predic- 
tions, similar to SVM. 

V. Related Work 

There is a large body of research in online learning for 
regression problems. Almost half a century ago, Widrow and 
Hoff 1 37 1 developed a variant of the least mean squares 
(LMS) algorithm for adaptive filtering and noise reduction. 
The algorithm was further developed and analyzed extensively 
(for example by Feuer 1151). The normalized least mean 
squares filter (NLMS) ||3)71]4] builds on LMS and performs 
better to scaling of the input. The recursive least squares 
(RLS) ||22) is the closest to our algorithms in the signal 
processing literature and also maintains a weight-vector and a 
covariance-like matrix, which is positive semi-definite (PSD), 
that is used to re-weight inputs. 

In the machine learning literature the problem of online 
regression was studied extensively, and clearly we cannot 
cover all the relevant work. Cesa-Bianchi et al. [7j studied 
gradient descent based algorithms for regression with the 
squared loss. Kivinen and Warmuth f25^ proposed various 
generalizations for general regularization functions. We refer 
the reader to a comprehensive book in the subject |9|. 

Foster fVl] studied an online version of the ridge regression 
algorithm in the worst-case setting. Vovk [21 1 proposed a 
related algorithm called the Aggregating Algorithm (AA), 
which was later applied to the problem of linear regression 
with square loss |36|. Forster 1 16] simplified the regret analysis 
for this problem. Both algorithms employ second order infor- 
mation. ARCOR for the separable case is very similar to these 
algorithms, although has alternative derivation. Recently, few 
algorithms were proposed either for classification |[8), |[TT|- 
p3] or for general loss functions fT4|, p8) in the online 
convex programming framework. AROWR J35[ shares the 
same design principles of AROW p2) yet it is aimed for 
regression. The ARCOR algorithm takes AROWR one step 
further and it has two important modifications which makes it 
work in the drifting or shifting settings. These modifications 
make the analysis more complex than of AROW. 

Two of the approaches used in previous algorithms for 
non-stationary setting are to bound the weight vector and 
covariance reset. Bounding the weight vector was performed 
either by projecting it into a bounded set ([23), shrinking 
it by multiplication |26|, or subtraction of previously seen 
examples |i6J. These three methods (or at least most of their 
variants) can be combined with kernel operators, and in fact, 
the last two approaches were designed and motivated by 
kernels. 

The Covariance Reset RLS algorithm (CR-RLS) |[T0|, pO), 
pT| was designed for adaptive filtering. CR-RLS makes 



covariance reset every fixed amount of data points, while 
ARCOR performs restarts based on the actual properties 
of the data - the eigenspectrum of the covariance matrix. 
Furthermore, as far as we know, there is no analysis in the 
mistake bound model for this algorithm. Both ARCOR and 
CR-RLS are motivated from the property that the covariance 
matrix goes to zero and becomes rank deficient. In both algo- 
rithms the information encapsulated in the covariance matrix 
is lost after restarts. In a rapidly varying environments, like 
a wireless channel, this loss of memory can be beneficial, as 
previous contributions to the covariance matrix may have little 
correlation with the current structure. Recent versions of CR- 



RLS p9) , 1 33 1 employ covariance reset to have numerically 
stable computations. 

ARCOR algorithm combines both techniques with online 
learning that employs second order algorithm for regression. In 
this aspect we have the best of all worlds, fast convergence rate 
due to the usage of second order information, and the ability 
to adapt in non-stationary environments due to projection and 
resets. 

LASER is simpler than all these algorithms as it controls the 
increase of the eigenvalues of the covariance matrix implicitly 
rather than explicitly by "averaging" it with a fixed diagonal 
matrix (see ([T4]i), and it do not involve projection steps. The 
Kabnan filter p4| and the Hao algorithm (e.g. the work of 
Simon f32l) designed for filtering take a similar approach, yet 
the exact algebraic form is different. 

The derivation of the LASER algorithm in this work shares 
similarities with the work of Forster |16| and the work of 
Moroshko and Crammer |29j. These algorithms are motivated 
from the last-step min-max predictor Yet, the algorithms of 
Forster and Moroshko and Crammer are designed for the 
stationary setting, while LASER is primarily designed for 
the non-stationary setting. Moroshko and Crammer p9| also 
discussed a weak variant of the non-stationary setting, where 
the complexity is measured by the total distance from a 
reference vector u, rather than the total distance of consecutive 
vectors (as in this paper), which is more relevant to non- 
stationary problems. 

VI. Regret bounds 

We now analyze our algorithms in the non-stationary case, 
upper bounding the regret using more than a single comparison 
vector Specifically, our goal is to prove bounds that would 
hold uniformly for all inputs, and are of the form, 

LT(alg) < Lriiut}) + a{T) (y^^^)"' + /3(T) , 



for either P = 1 or P 2, a constant 7 and some functions 
a(r), /3(T) that may depend implicitly on other quantities of 
the problem. 

Specifically, in the next section we show that under a 
particular choice of = Aj(y(^)) for the ARCOR algorithm, 
its regret is bounded by, 

iT(ARCOR) < LT{{ut}) + (t^ (v^^^) ' logp) . 
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Additionally, in Sec. |VI-B[ we show that under proper choice 
of the constant c = c 

{W>), the regi-et of LASER is bounded 

by, 



Lt(LASER) < Lriiut}) + O ( (v'-^^ 



The two bounds are not comparable in general. For example, 
assume a constant instantaneous drift ||itf+i — itf|| = h' 
for some constant value u. In this case the variance and 
squared variance are, V''^^ = Tv and V'^'^^ = Tv^. The 
bound of ARCOR becomes j/sTlogT, while the bound of 
LASER becomes v^T. The bound of ARCOR is larger if 
(logT)^ > V, and the bound of LASER is larger in the 
opposite case. 

Another example is polynomial decay of the drift, — 
UtW < for some k > 0. In this case, for k ^ 1 we get 



For K, 



y^^^ < ELY < /i t-^dt+i 

1 we get 1/(1) < log(r- 1) + 1. For LASER we have, for n ^ 

0.5, < j2j-^' < f^-^ = (T-ir-'"-^- 



1-2k 



For K = 0.5 we get V'^^ < log(r — 1) + 1. Asymptotically, 
ARCOR outperforms LASER about when k > 0.7. 

Herbster and Warmuth |23| developed shifting bounds for 
general gradient descent algorithms with projection of the 
weight-vector using the Bregman divergence. In their bounds, 
there is a factor greater than 1 multiplying the term Lt {{ut}), 
leading to a small regret only when the data is close to be 
realizable with linear models. Busuttil and Kalnishkan |]5| 
developed a variant of the Aggregating Algorithm |21 1 for the 
non-stationary setting. However, to have sublinear regret they 
require a strong assumption on the drift V'-'^'^ = o(l), while 
we require only F^^) = o(T) (for LASER) or V^^) = o(T) 
(for ARCOR). 



Theorem 3: Assume that the ARCOR algorithm is run with 
an input sequence {xi, yi), . . . , {xt, Ut)- Assume that all the 
inputs are upper bounded by unit norm ||a;t|| < 1 and that 
the outputs are bounded by y = maxf \yt\. Let Ut be any 
sequence of bounded weight vectors ||Mt|| < Rb- Then, the 
cumulative loss is bounded by, 

Lt(ARCOR) < Lrilut}) + 2i?Br^ t^IIm*-! - Ut\ 



A, 



(*) 



2(i?| + y2)^logdet((S 



where n is the number of covariance restarts and I]*^^ is the 
value of the covariance matrix just before the ith restart. 
The proof appears in Sec. |C] Note that the number of restarts n 
is not fixed but depends both on the total number of examples 
T and the scheme used to set the values of the lower bound 
of the eigenvalues A^. In general, the lower the values of A^ 
are, the smaller number of covariance-restarts occur, yet the 
larger the value of the last term of the bound is, which scales 
inversely proportional to A^. A more precise statement is given 
in the next corollary. 

Corollary 4: Assume that the ARCOR algorithm made n 
restarts. Under the conditions of Theorem [3] we have. 



Lt (ARCOR) < + 2i?BrA,7 



t 



\Ut-l - Ut\ 



T \ 



Proof: By definition we have 



^ Ti+U 



XtxJ 



A. Analysis of ARCOR algorithm 

Let us define additional notation that we will use in our 
bounds. We denote by ti the example index for which a restart 
was performed for the ith time, that is Sf. = / for all i. 
We define by n the total number of restarts, or intervals. We 
denote by Ti — ti — ti-i the number of examples between two 
consecutive restarts. Clearly T = ^"=1 Finally, we denote 
by E*~i = St._i just before the ith restart, and we note that 
it depends on exactly Ti examples (since the last restart). 

In what follows we compare the performance of the AR- 
COR algorithm to the performance of a sequence of weight 
vectors Ut G M'^ all of which are of bounded norm Rb- In 
other words, all the vectors Ut belong to B. We break the 
proof into four steps. In the first step (Theorem |3]l we bound 
the regret when the algorithm is executed with some value 
of parameters {A^} and the resulting covariance matrices. 
In the second step, summarized in Corollary |4j we remove 
the dependencies in the covariance matrices, by taking a 
worst case bound. In the third step, summarized in Lemma |5] 
we upper bound the total number of switches n given the 
parameters {A^}. Finally, in Corollary |6] we provide the regret 
bound for a specific choice of the parameters. We now move 
to state the first theorem. 



Denote the eigenvalues of EtlT*' ^txj by Ai, . . . , A^. Since 
\\xt\\ < 1 their sum is Tr (^Etlt*" ^txj^ < T,. We use the 
concavity of the log function to bound logdet ^(S*) = 
^^log (^l + < dlog (l + ^) . We use concavity again 



to bound the sum 



ji;iogclet((E') ') <X;ollog(l 



ra 

<dnlog(l + —] , 
\ nra J 

where we used the fact that Ti = T. Substituting the last 
inequality in Theorem |3] as well as using the monotonicity 
of the coefficients, A^ > A„ for all i < n, yields the desired 
bound. ■ 
Implicitly, the second and third terms of the bound have 
opposite dependence on n. The second term is decreasing 
with n. If n is small it means that the lower bound A„ is 
very low (otherwise we would make many restarts) and thus 
A~i is large. The third term is increasing with n T. We 
now make this implicit dependence explicit. 

Our goal is to bound the number of restarts n as a function 
of the number of examples T. This depends on the exact 
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sequence of values used. The following lemma provides 
a bound on n given a specific sequence of A^. 

Lemma 5: Assume that the ARCOR algorithm is run with 
some sequence of Aj. Then, the number of restarts is upper 
bounded by. 



n < max < N 

N 



N 



T>rV(A--l) 



Proof: Since 51]"= i ^« ^ ^^^^ '■^^ number of restarts 
is maximized when the number of examples between restarts 
Ti is minimized. We prove now a lower bound on Ti for all 
i = 1 . . .n. A restart occurs for the ith time when the smallest 
eigenvalue of is smaller (for the first time) than A^. 

As before, by definition, (S')"^ =1+1 Ef=t'' ^txj . By 
a result in matrix analysis |18, Theorem 8.1.8] we have that 
there exists a matrix A e M''^^' with each column belongs to 
a bounded convex body that satisfy a^.i > and J^k < 1 
for / = 1, . . . , Ti, such that the fcth eigenvalue of (E*) ^ 
equals to AJ^, = 1 + ^ Y^f^i '^k,i- The value of Ti is defined 
as when largest eigenvalue of (S*) ^ hits A^^. Formally, we 
get the following lower bound on Ti, 



arg mm s s.t. 



max ( 1 + - V ak,i ) > A/ 

(ik,i > for fc = 1, . . . , d, Z = 1, . . . , s 

^flfc,; < 1 for Z < l,...,s 
k 



For a fixed value of s, a maximal value 
maxfc (l + ^ l^l^i 0'k,i) is obtained if all the "mass" 
is concentrated in one value k and for this k each 
ak,i is equal to its maximal value 1. That is, we have 
ak,i — 1 for k ^ ko and ; = otherwise. In this case 
maxfc (l + ^ X]f=i '^k,i) — (l + r^) the lower bound is 
obtained when (l + ^s) — A^^. Solving for s we get that 
the shortest possible length of the ith interval is bounded by, 
Ti > r (A^^ — l) . Summing over the last equation we get, 
T = T^ > rX;r {K^ - l) ■ Thus, the number of restarts 
is upper bounded by the maximal value n that satisfy the last 
inequality. ■ 

We now prove a bound for a specific choice of the pa- 
rameters {A;}, namely polynomial decay, A^"^ = i''^^ + 1. 
This schema to set {A,} balances between the amount of 
noise (need for many restarts) and the property that using the 
covariance matrix for updates achieves fast-convergence. We 
note that an exponential schema A^ = will lead to very few 
restarts, and very small eigenvalues of the covariance matrix. 
Intuitively, this is because the last segment will be about half 
the length of the entire sequence. Combining Lemma |5] with 
Corollary |4] we get. 

Corollary 6: Assume that the ARCOR algorithm is run 
with a polynomial schema, that is A^ 



q ^ 0- Under the conditions of Theorem [5] we have, 
iT(ARCOR) < Lriiut}) + ru^Y.:j}uT 

+ 2(i?| + r2)d(gT + l)hog(^l+-^^ (23) 

+ 2RBr ((gT+l)^ +l) - "^ll- (24) 

t 

Proof: Substituting A,^^ = i^^^ + 1 in Lemmajsjwe get, 

T > r V (A^i - 1) = ^ ^""^ >r x'^'^dx = - (n« - 1) 

this yields an upper bound on n, 

n<{qT+iy^ => A-^ < (qT + l)^ + 1 

m 

Comparing the last two terms of the bound of Corollary |6] we 
observe a natural tradeoff in the value of q. The third term 
of ( |23] l is decreasing with large values of q, while the fourth 
term of (|24| is increasing with q. 

Assuming a bound on the deviation J^t ~ '"■til — 

V^^^ < O {T^/P), or in other words p = (logT) / (logF^i)). 
We set a drift dependent parameter q — (2p) / (p + 1) = 
(2 log T) / (log T + log F'l') and get that the sum of (|23]l and 

(|24]) is of order O (t"^ log(T)) = O (/V^Wriog Tj. 

Few comments are in order. First, as long as p > 1 the sum 
of ( |23] l and (|24]) is o(T) and thus vanishing. Second, when 
the noise is very low, that is p w — (1 + e), the algorithm sets 
q Ki 2 + (2/e), and thus it will not make any restarts, and the 
bound of ©(log T) for the stationary case is retrieved. In other 
words, for this choice of q the algorithm will have only one 
interval, and there will be no restarts. 

To conclude, we showed that if the algorithm is given an 
upper bound on the amount of drift, which is sub-linear in T, 
it can achieve sub-linear regret. Furthermore, if it is known 
that there is no non-stationarity in the reference vectors, then 
running the algorithm with large enough q will have a regret 
logarithmic in T. 



B. Analysis of LASER algorithm 

We now analyze the performance of the LASER algorithm 
in the worst-case setting in six steps. First, state a technical 
lemma that is used in the second step (Theorem |8]l, in 
which we bound the regret with a quantity proportional to 
Y^^ixJ D^^Xt- Third, in Lemma M we bound each of the 
summands with two terms, one logarithmic and one linear in 



the eigenvalues of the matrices Df. In the fourth (Lemma 10 1 
and fifth (Lemma 1 1 1 steps we bound the eigenvalues of Dt 



first for scalars and then extend the results to matrices. Finally, 



in Corollary 12 we put all these results together and get the 



desired bounds. 

Lemma 7: For all t the following statement holds. 



,•9-1 



1 for some 



-D-_\ ^ 
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where as defined in (p2jl we have Df_^ = (l + c ^Dt-i) ^ . 

The proof appears in Sec. |D] We next bound the cumulative 
loss of the algorithm, 

Theorem 8: Assume that the labels are bounded sup^ \yt\ < 
Y for some F e M. Then the following bound holds. 



Lt (LASER) < min iLriiut}) + cV^''' {{ut}) + b\\ui 

Ui,...,Ut I 

T 

+ Y^Y.'^^tDi^^t ■ (25) 



+e7-i 



Proof: Fix t. A long algebraic manipulation, given in 
Sec. [E] yields, 

{yt-ytf+ min Qt-i{ui,. . . ,ut-i) 

Ml,...,Ut_l 

- min Qt{ui, . . . ,ut) 

Ui,...,Ut 

= {yt - ytf + 2ytxlDi^D'^^^et_^ 

D-_\+D[_,{D^'D[^,+ c-'l) et_i 

+ y^xjD^'xt-y^ . (26) 

Substituting the specific value of the predictor i/t = 
xj D^^ D[_iet-i from pT| ), we get that p6l ) equals to, 

ft+vlxjD^^Xt 

+ ej_, -D-_\ + D[_,{D-'D[_,+c-^l) e,_i 
= eJ_,D',_,D^'xtxjD^'D',_^et-i+yfxjD^'xt 
D-_\ + D[_, {Di'D[_, + c-i/) et_i 
ej_^btet-i + y^xjD^^Xt , (27) 



where Df 



d: 



D[_^{D^^D[_^+ c ^/). Using Lemma [t] we upper 
bound -Dt ^ and thus (|27]i is bounded. 



Proof: Let Bt = Dt - XtxJ = (Dj"_\ + c-i/) ^ ^ 0. 

a;7z?,-ia;t = Tr {xjD^'xt) = Tr (A^'ajta;^) 
= Tr {D^' {Dt - Bt)) 
= Tt (^D;'^^ {Dt~Bt)D;^'^) 



Tt{I - Dt ^'"^BtDt ^'"^ 



^ [l - A, [D^^'^BtD;^'' 



We continue using 1 — x < — In [x) and get 

d 

xjD^^Xt < -J^ln [Aj (^D;^^^BtD, 



-1/2 



In 



l[X,{D;'^'BtD 



-l/2p n-l/2 



In iD^^/^SfDj"^/^ 



lA 



Dt - XtxJ 



It follows that. 



xj Dt ^Xt < In 



= In 
= In 



lA 



{D-\+c-^I 

lA' 



lA-il 
lAl 



I + c-'Dt-i) 



lA-i 

and because In I ^ Do I > we get 



ln|(/ + c-^A-i)| 



Y,xjD^'xt<ln 



-Di 



< In 



-Dj 



^ln|(/ + c-iA-i) 
t=i 

T 

c-i^Tr(A-i) . 



< Y^xjDt'xt . 
Finally, summing over t E {1, . . . ,T} gives the desired bound, 
Lt(LASER)- min \b\\uif + cVS^\{ut}) + Lriiut}) 

Ui ,. . .,Ut L 



t=i 



In the next lemma we further bound the right term of i 
This type of bound is based on the usage of the covariance-Hke 
matrix D. 

Lemma 9: 



Y^xjD-'xt<\n 
t=i 



1 



-A 



c-i^Tr(A-i) 



(28) 



At first sight it seems that the right term of ( |28| l may grow 
super-linearly with T, as each of the matrices Dt grows with 
t. The next two lemmas show that this is not the case, and 
in fact, the right term of ( p8] l is not growing too fast, which 
will allow us to obtain a sub-linear regret bound. Lemma 10 
analyzes the properties of the recursion of D defined in ( |14[ ) 
for scalars, that is d = 1. In Lemma 11 we extend this analysis 
to matrices. 

Lemma 10: Define /(A) = A/?/ (A + /3) + for /?, A > 
and some x'^ < 7^- Then: 

1) /(A)</3 + 7' 

2) /(A)< A + 72 

3) /(A) < max <^ A, 

Proof: For the first property we have /(A) = 
A/3/ (A + /3) + < /? X 1 + x^. The second property follows 
from the symmetry between /3 and A. To prove the third 
property we decompose the function as, /(A) = A — +x^. 
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Therefore, the function is bounded by its argument /(A) < A 
if, and only if, + < 0. Since we assume < 7^, 

the last inequ ality hold s if, — A^+7^A + 7^/3 < 0, which holds 

forA> "^+^t^ - 

To conclude. If A > ^^+\^^^+^^^^ ^ then /(A) < A. 
Otherwise, by the second property, we have, 

/(Aj<A+7 < h7 = , 

as required. ■ 
We build on Lemma 10 to bound the maximal eigenvalue 

of the matrices Dt- 

Lemma 11: Assume \\xt\\^ < for some X. Then, the 

eigenvalues of Dt (for t > 1), denoted by A; {Dt), are upper 

bounded by 



we have. 



max Xi (Dt) < max ■ 



3X^ + VX*TiX^. 



b + X^ 



Proof: By induction. From ( [T4] i we have that Ai(Z?i) < 
h + X'^ for i = 1, . . . , d. We proceed with a proof for some 
t. For simplicity, denote by A^ = Ai(Z?t_i) the ith eigenvalue 
of Dt-i with a corresponding eigenvector Vi. From ([14]) we 
have. 



Dt = {D-\+c-'l)-' 



- XtxJ 



^ d: 



c-'l) '+I\\Xtf 



i 
-J 

A,;C ,,5 



,Ai + c 

Plugging Lemma [TO] in ( [29] ) we get. 



(29) 



VivJ max I 



6 + 



max ■ 



3X^ + ^/X^T4X^c 



,b + X^}I 



Finally, equipped with the above lemmas we are able to 
prove the main result of this section. 

Corollary 12: Assume ||a;t||^ < X^, \yt\ < Y. Then, 



Lt(LASER) < b\\uif + Lriiut}) In 



1. 



D7 



-^Y^Tr (Do) + cV'^^^ 



+c'^Y'^Tdma.x- 



3X^ + VX^ + 4X^c 



b + X^ 



(30) 



Furthermore, set b — ec for some < e < L Denote by /i = 
max I 9/8X2, ^^i^^ I ^j^^j j^j- ^ max {SX^, 6 + X^}. If 



(31) 



y{2) < j, V2Y^^dX (Jq^ jjjgjj ggjjjjjg 

V2TY^dX 



,2/3 



Lt (LASER) < 

6||ui||V 



1 



^Y^dX 

- LT{{ut\) 



II?, 



rp2/l. 



1/3 



Y^ In 



(32) 



Note that if ^(2) > yllrfM then 



(33) 



The proof appears in Sec 
by setting c = ^^^^^^^^7^^^^ 

we have, 

Lt(LASER) < 6||ui||V2Vy2^rMF(2) 

(See Sec.]G]for details). The last bound is linear in T and can 
be obtained also by a naive algorithm that outputs yt — for 
all t. 

A few remarks are in order. When the variance V^^'^ — 
goes to zero, we set c = 00 and thus we have Dt = bl + 



E1=i ^sxJ used in recent algorithms (||8j, 1 16 , |22|, |36 ). In 
this case the algorithm reduces to the algorithm by Forster | fT6[ 
(which is also the AAR algorithm of Vovk p6)), with the same 
logarithmic regret bound (note that the last term in the bounds 
is logarithmic in T, see the proof of Forster |l6j). See also 
the work of Azoury and Warmuth |]2]. 

VII. Simulations 

We evaluate our algorithms on three datasets, one synthetic 
and two real world. The synthetic dataset contains 2, 000 
points in M^o^ where the first ten coordinates were grouped 
into five groups of size two. Each such pair was drawn from 
a 45° rotated Gaussian distribution with standard deviations 
10 and 1. The remaining 10 coordinates were drawn from 
independent Gaussian distributions Af{0,2). The dataset was 
generated using a sequence of vectors Ut G M^o for which 
the only non-zero coordinates are the first two, where their 
values are the coordinates of a unit vector that is rotating 
with a constant rate. Specifically, we have ||Mt|| — 1 and the 
instantaneous drift — Mt-i|| is constant. 

The other two datasets are generated from echoed speech 
signal. The first speech echoed signal was generated us- 
ing FIR filter with k delays and varying attenuated am- 
plitude. This effect imitates acoustic echo reflections from 
large, distant and dynamic obstacles. The difference equation 
y{n) = x{n) + A{n)x{n — D) + v{n) was used, 

where D is a delay in samples, the coefficient A{n) describes 
the changing attenuation related from object reflection and 
v{n) ^ N (0, 10~^) is a white noise. The second speech 
echoed signal was generated using a flange IIR filter, where 
the delay is not constat, but changing with time. This effect 
imitates time stretching of audio signal caused by moving 
and changing objects in the room. The difference equation 
y{n) — x{n) + Ay {n — D{n)) + v{n) was used. 

Five algorithms are evaluated: NLMS (normalized least 
mean square) (Q, Q) which is a state-of-the-art first-order 
algorithm, AROWR (AROW for Regression) with no restarts 
nor projection, ARCOR, LASER and CR-RLS. For the syn- 
thetic datasets the algorithms' parameters were tuned using 
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Fig. 1. Cumulative squared loss for AROWR, ARCOR, LASER, NLMS and CR-RLS vs iteration. Left panel shows result for synthetic datasets with drift, 
and two right panels show results for a problem of acoustic echo cancelation on speech signal (best shown in color). 



a single random sequence. We repeat each experiment 100 
times reporting the mean cumulative square-loss. We note that 
AAR (|||T), (36)) is a special case of LASER and RLS is a 
special case of CR-RLS, for a specific choice of their respec- 
tive parameters. Additionally, the performance of AROWR, 
AAR and RLS is similar, and thus only the performance 
of AROWR is shown. For the speech signal the algorithms' 
parameters were tuned on 10% of the signal, then the best 
parameter choices for each algorithm were used to evaluate 
the performance on the remaining signal. 

The results are summarized in Fig. [T] AROWR performs 
worst on all datasets as it converges very fast and thus not able 
to track the changes in the data. Focusing on the left panel, 
showing the results for the synthetic signal, we observe that 
ARCOR performs relatively bad as suggested by our analysis 
for constant, yet not too large, drift. Both CR-RLS and NLMS 
perform better, where CR-RLS is slightly better as it is a 
second order algorithm, and allows to converge faster between 
switches. On the other hand, NLMS is not converging and is 
able to adapt to the drift. Finally, LASER performs the best, 
as hinted by its analysis, for which the bound is lower where 
there is a constant drift. 

Moving to the center panel, showing the results for first 
echoed speech signal with varying amplitude, we observe that 
LASER is the worst among all algorithms except AROWR. 
Indeed, it does preventing convergence by keeping the learning 
rates far from zero, yet it is a min-max algorithm designed 
for the worst-case, which is not the case for real-world 
speech data. However, speech data is highly regular and the 
instantaneous drift vary. NLMS performs better as it is not 
converging, yet both CR-RLS and ARCOR perform even 
better, as they both not-converging due to covariance resets 
on the one hand, and second order updates on the other hand. 
ARCOR outperforms CR-RLS as the former adapts the resets 
to actual data, and is not using pre-defined scheduling as the 
later. 

Finally, the right panel summarizes the results for evalua- 
tions on the second echoed speech signal. Note that the amount 
of drift grows since the data is generated using flange filter 
Both LASER and ARCOR are out-performed as both assume 
drift that is sublinear or at most linear, which is not the case. 



CR-RLS outperforms NLMS. The later is first order, so is able 
to adapt to changes, yet has slower converge rate. The former 
is able to cope with drift due to resets. 

Interestingly, in all experiments, NLMS was not performing 
the best nor the worst. There is no clear winner among the 
three algorithms that are both second order, and designed to 
adapt to drifts. Intuitively, if the drift suits the assumptions 
of an algorithm, that algorithm would perform the best, and 
otherwise, its performance may even be worse than of NLMS. 

We have seen above that ARCOR performs a projection 
step, which partially was motivated from the analysis. We 
now evaluate its need and affect in practice on two speech 
problems. We test two modifications of ARCOR, resulting in 
four variants altogether First, we replace the the polynomial 
thresholds scheme to the constant thresholds scheme, that is, 
all thresholds are equal. Second, we omit the projection step. 
The results are summarized in Fig. |2] The line corresponding 
to the original algorithm, is called "proj, poly" as it performs 
a projection step and uses polynomial schema for the lower- 
bound on eigenvalues. The version that omits projection and 
uses constant schema, called "no proj, const", is most similar 
to CR-RLS. Both resets the covariance matrix, CR-RLS after 
fixed amount of iterations, while "ARCOR-no proj, const" 
when the eigenvalues meets a specified fixed lower bound. 
The difference between the two plots is the amount of drift 
used: the top panel shows results for sublinear drift, and the 
bottom panel shows results with increasing per-instance drift. 
The original version, as hinted by the analysis, is designed 
to work with sub-linear drift, and performs the best in this 
case. However, when this assumption over the amount of 
drift breaks, this version is not optimal anymore, and constat 
schema performs better, as it allows the algorithm to adapt 
to non-vanishing drift. Finally, in both datasets, the algorithm 
that perform best perform a projection step after each iteration. 
Providing some empirical evidence for its need. 

VIII. Summary and Conclusions 

We proposed and analyzed two novel algorithms for non- 
stationary online regression designed and analyzed with the 
squared loss in the worst-case regret framework. The ARCOR 
algorithm was built on AROWR, that employs second order 
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Fig. 2. Cumulative squared loss of four variants of ARCOR vs iteration. 
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information, yet performs data-dependent covariance resets, 
which provides it the abihty to track drifts. The LASER 
algorithm was buih on the last-step minmax predictor with 

the proper modifications for non-stationary problems. Regret + [yt — uj x-t^ 

bounds shown and proven are optimized using knowledge on 
the amount of drift, and in general the two algorithms are not 
comparable. 

Few open directions are possible. First, to extend these 
algorithms to other loss functions rather than the squared loss. 
Second, currently, direct implementation of both algorithms 
requires either matrix inversion or eigenvector decomposition. 

A possible direction is to design a more efficient version of B. Proof of Lemma |2] 
these algorithms. Third, an interesting direction is to design 

algorithms that automatically detect the level of drift, or do Proof: By definition 

not need this information before run-time. 



c||Mt -Mt-l| 



Appendix 



Pi (ui) = Qi (iti) = 6||mi||^ + (yi - ujxi) 

= uj (bl + xixj) Ml - 2yiujxi + yj 



A. Proof of Lemma [7] 
Proof We calculate 



and indeed Di = bl + XixJ , ei = yiXi, and /i = yf. 
We proceed by induction, assume that, Pt-i {ut-i 
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Applying Lemma [T] we get, 

Pt (ut) = min uJ_^Dt-iUt-i - 2uJ_j^et-i + ft-i 

+ c\\ut ~ut-if + {yt -ujxt)^"^ 
= min Mj'Li (c/ + Dt-i) Ut-i ~ 2uJ_;^ {cut + et-i) 

Ut-l \ 

+ ft-i+c\\utf + (yt-ujxt)'^ I 



Proof: We start by writing the distances explicitly 

dt-i {wt-i,ut-i) - df {wt, ut-i) 

= ~ (Mt_i - Wt)^tt^ {Ut^i - Wt) 

Substituting Wt as appears in ([T0| the last equation becomes. 



[Ut- 



(Ut- 



jyt-xjwt-i) 

r + xjT,t~iXt 



{cut + et^i)^ {cl + Dt^i) ^{cut + Bt-i) 



It-i 



— ■ — ^- Xt St-iSt ^t-ixt 

r + xl St-iajf / 



iyt-ujxtf 



+ C\\Ut\\ 

uj (^cl + xtxj - {cl + Dt-iY^ 
-2uJ c{cl + Dt-iY^ et-i+ytXt 



Ut 



- e7_i (c/ + A-i) et-i + ft-i + yt ■ 

Using the Woodbury identity we continue to develop the last 
equation, 

-1" 



+ {ut-i - Wt-i)^T,^\ {ut-i - wt^i) 
Plugging as appears in (|9]l we get, 

dt-i {wt-i,ut-i) - diiwt,ut-i) 

= - {Ut-l ~ Wt-l) 



XtX^ {Ut-1 - Wt^i) 



+ 2(Mt_i-'Wt-i)^( T,^\ + -XtxJ ) T^t-iXt 



cl + xtxj-c^ c^^ I - c-"^ {D^\ + c-^ I) 
(/ + c"^A-i) ^et-i+ytXt 



{yt 



xjwt^i) 



Ut 



2u 



eJ_^{cI + Dt-i) ^et-i + ft-i 



(r + xjT,t-iXt) 
+ {ut-i - Wt-i)^T,^_}^ {ut-i - wt-i) 



{yt-xjwt-i) 
r+xjEt-iXt 



xj'^t-i i ^t-i H XtxJ j Tit-iXt 



V't 



Finally, we substitute £t 



2u 



„-i 



XtxJ 1 Ut 



{yt-xJwt-iY 



Dt-i 



et-i + ytXt 



9t = 

{yt-x^Ut^iY and, xt = xlY^t^iXt. Rearranging the 
terms. 



e7_i(c/ + A-i) ^ et-i + ft-i + yt 



and indeed D* 



d: 



{l + c ^A-i) ef_i + ytXt and, ft 
ej_i {cl + Dt-i) ^ Bt-i + yt, as desired. 



XtxJ , et 
— ft-i 



dt-i {wt~i,ut^i) - dt {wt, Mt_i) 
= - ^ (yt - xjwt-i - {yt - xjut^i))^ 
2{yt-xjut-i-{yt-xjwt^i)) {yt- 



-xjwt-i) 



r + Xt 



C. Proof of Theorem [i] 

We prove the theorem in four steps. First, we state the 
following technical lemma, for which we define the following 
notation, 

dt {z,v) = {z~vyY.i\z~v) , 
di{z,v) = {z-v)'^ti\z-v) , 
Xt = xjT.t-iXt . 



itXt 
{r + Xtf 

-it +2{yt- xjwt-i) {yt - xjut-i) gt 



Xt 
r 



24 



- 2 



r 



r + Xt 

'yt - 



etxt 



r{r + Xt) 
xjwt-i) {yt - xjut-i) 

r + Xt 
1 itXt 



Xt 
r 



-9t 



r{r + xt) 



which completes the proof. 



Second, we define a telescopic sum and in Lemma 14 prove a 
lower bound for each element. Third, in Lemma [TSpwe upper We now define one element of the telescopic sum, and lower 
bound one term of the telescopic sum, and finally, in the fourth bound it, 
step we combine all these parts to conclude the proof. Let us Lemma 14: Denote by 
start with the technical lemma. 

Lemma 13: Let Wt and Ei be defined in (|9]l and ( fTO] ), then. 



At = dt-i {wt-i,ut^i) - dt {wt, Ut) 



dt-i {wt-i, Ut-l) -di{wt, Ut-l) 



itXt 



r^* r{r + xt) 



where £t = {yt ~ wJ_j^XtY and gt = {yt - uJ_^Xt)'^. 



then 

At>het-9t)-it , , 

r r{r + xt) 



uJ_J:^\ut-l - ujT^t '^ut - 2RbK, ^||Mt-i 



Ut\ 
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where i — 1 is the number of restarts occurring before example 
t. 

Proof: We write At as a telescopic sum of four terms as 
follows, 

At,i = dt-i (wt-i,ut-i) - d{ {wt, Mt_i) 
At, 2 = rft {wt, itt-i) - dt (Wt, Mt-l) 
At, 3 = dt {wt, itt_i) - dt (wt, Mt-i) 
At,4 = dt {wt, ut-i) - dt {wt, Ut) 
We lower bound each of the four terms. Since the value of 



At,i was computed in Lemma 13 we start with the second 
term. If no reset occurs then Sf — Sf and At, 2 = 0. 
Otherwise, we use the facts that ^ St ^ / and Et — I, 
and get. 



(Wt 



l) 



Mt-l) ^7 {Wt - 
- {Wt - Ut^l)^T.t^ (Wt ~ Mt-l) 

> Tr ({wt - ttt-i) {wt - ut-if {I - I) 







To summarize. At, 2 > 0. We can lower bound At, 3 > by 
using the fact that Wt is a projection of Wt onto a closed 
set (a ball of radius Rb around the origin), which by our 
assumption contains ttf. Employing Corollary 3 of Herbster 
and Warmuth |23| we get, dt {wt, Mt-i) > dt {wt, Mt-i) and 
thus At,3 > 0. 

Finally, we lower bound for the fourth term At ,4, 

Af,4 = {Wt - Mt-l)^St"^ {Wt - Ut-l) 



{Wt - Ut)^^t ^ {"^t 



Ut) 



iJ^i^Ut - 2wJ^t^ (Mt_i 



(34) 

- Ut) 



We use Holder inequality and then Cauchy-Schwartz inequal- 
ity to get the following lower bound, 

- 211,]^-^ {ut.i - Ut) - -2Tr (E-i (ttt_i - Ut) wj) 

> -2X,nax (Sr^) wJ (Mt_i - Ut) 

> -2X,nax (Sr^) - Ut\\ . 

Using the facts that \\wt\\ < Rb and that \„iax (^t^^) = 
l/Amiri(St) < A^^, where i is the current segment index, 
we get. 



-2wJj:-^ (Mt_i - Ut) > -2A-^RB\\ut-i 



Ut 



(35) 



Substituting ( (35] l in ( |34l ) and using Et ^ St-i a lower bound 
is obtained. 



At,4 > Mt-l^t Ut-i - Mt St Ut - 2i?BA. '||Mt-l - Mill 

> uj_^^t\ut-i - ujY^t^Ut 

-2i?sAr'l|wt-i -wtll • (36) 



Combining ( [36| l with Lemma 13 concludes the proof. ■ 
Next we state an upper bound that will appear in one of the 
summands of the telescopic sum. 

Lemma 15: During the runtime of the ARCOR algorithm 
we have. 



ti+Ti 

Xt 



tt^ iXt + r) 



<log (det (s-i^_i)) =log (det ((SO ' 



We remind the reader that ti is the first example index after 
the ith restart, and Ti is the number of examples observed 
before the next restart. We also remind the reader the notation 
S* — Sf.^j_i is the covariance matrix just before the next 
restart. 

The proof of the lemma is similar to the proof of Lemma 4 



by Crammer et. al. 1 12| and thus omitted. We now put all the 
pieces together and prove Theorem [3] 

Proof: We bound the sum J^t from above and below, 
and start with an upper bound using the property of telescopic 
sum, 

X] = X! i'^t-i,ut^i) - dt {wt, Ut)) 
t t 

= do {wo, uo) - dr {wt, ut) < do {wo, uq) . (37) 
We compute a lower bound by applying Lemma [14] 

Xt 



t t \ 



9t) - 



>(r + xt) 

+ MtLiSjl\Mt-i - ujT^t^Ut - 2RBAT'^l^\\ut-i - ut\\^ 

where i{t) is the number of restarts occurred before observing 
the tth example. Continuing to develop the last equation we 
obtain. 



Xt 



r{r + xt) 



-^2i?BA^-i)||wt-i-Mt|| 
t 

4E^-!:E5-E^* 



- uJTjq ^Uo ~ vJpYirj} Ut 
2RB^tCi^t)\\'^t-^ - "tl 



(38) 



Combining ( |37| ) with ( |38| ) and using doiwo.uo) 
M([S(7^Mo (as Wo = 0), 



!:E^-'E^-E^* 



Xt 



r(r + xt) 



2i?s^A,;^i)||Mt_i-Mt|| <o 



Rearranging the terms of the last inequality. 

E^^^E^'^+E^ 



Xt , 



r + Xt 



2^sr^-r^||wt-i 



t 



Ut\ 



Since \\wt\\ < Rb and we assume that \\xt\\ = 1 and 
supt \yt\ Y, we get that supj it < 2{R% + Y'^). Substituting 



the last inequality in Lemma 15 we bound the second term 
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in the right-hand-side, 



Xt 



n ti+Ti 



Xt 



We further develop the last equality and use ( [39) l and ( |40[ ) in 
the second equality below, 

= {l+c-^Dt-iy'D^^XtxjD^^ {l+c-^Dt-iy^-D-\ 



<^ (^sup4jlogdet((S')" 
<2(i?| + y2)5]logdet((sO 

i 

which completes the proof. ■ 

D. Proof of Lemma [7| 

Proof: We first use the Woodbury equation to get the 
following two identities 



D 



D-_\xtxjD-_\ 



t-i 



D: 



1 - 




+ c-i/) Xt 




\xtxj {D^ 
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+ c-^l) Xt 


(A- 




XtxjD-\ 
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f c-i/) Xt 



XtXf 



D-\xtxjD-\ 
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xj {D-\ + cr^l) XtD-\xtxjD-\ 



[l + xj {D-_\ 
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{D-\+c-'l) '+xtxj 



n-i ^ -1, {Di\+c-H)xtxJ {D~\+c-^l) 



1 + xJ {D-\ +c-^l) Xt 
(/ + c-^Dt-,)-' =I-c-' {D-_\ + c-'iy' . 

Multiplying both identities with each other we get, 

oy (i + c-'Dt^iV' 



E. Derivations for Theorem [S] 



D-_\+c-'l- 



{D-\ + c-H) xtxj {D-_\ + c-H) 
1 + xJ {D-_}^+c-^l) Xt 



iVt-yt) + min (9t-i(Mi, 
- min Qt{ui, . . . ,ut) 

Ui,...,Ut 

,T n-1 



,Mt-lJ 



i-c-^{D-\ + c-'iy'' 

i + xj (ny^ + c-u) Xt 



= iVt - TJt) - et^^D^_^et-i + ft^i + D^ Cf - /, 
= {yt - ytf - eJ_^Dy^et-i 

-1 \ ^ 

[l + c-^Dt-i) et-i+ytxt) Oy 



_ n-i _ {Dy,+c-^l)xtxjDy^ 



(39) 



I + c ^Dt-i) et_i +ytXt 



+ej_ycl + Dt-i) et-i-yf 



and, similariy, we multiply the identities in the other order and where the last equaUty follows from ^ and (fTeJl. We proceed 
get. 



i + c-^Dt-i) ^Dy = 
i-c-yDy, + cr'iy'] oy^ + c-^i 
{oy, + c-i/) xtxj {oy, + c-H) " 



1 + xj {Dy, 



c-H) Xt 
-1 J 



_ . oy.xtxj {Dy, + c-H 
— ^t-i 



to develop the last equality, 

= ijjt-ytf ~ey^ Oy^ et-1 

+eJ_yi + c-'Dt-iy^ Dy {l + c-'Ot-iy^et^i 
+2ytxjDy {I + c-^Dt-iy^ et-1 
+yfxjDyxt + ej_^ {cl + Dt-iy^ et-i - yf 

= {yt - ytf + e7_i ( - Z?-\ 



(40) 



l + xj {Dy^+c-H)xt ' 
Finally, from ([39| we get, 

(i+c-'Dt-iy^ oyxtxjoy {i+c-^Dt-iy^ -oy^ 
+ {i + c-'Dt-iy' [oy {I + c-^A-i)"' + c-H 
: {I + c-^Dt-iy' oyxtxj Dy {i+c-^Dt-iy' 



(i+c-'Dt-i) oyii + c-^D, 
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-1 



et-i 



-2ytxjDy {l+c-^Dt-i) ^ et-i+vlx] Dyxt~yl 



D 



d: 



I-c-^DV}. +C-H) 



t-i 



oy, + c-H 
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(yt - yt)" + e7-i 
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F. Proof of Corollary \12\ 

Proof: Plugging Lemma |9] in Theorem |8] we have for all 

{Ui . ..Ut), 



Xt (LASER) 

< b\\ui\f + cV^^'^ +LT{{ut}) + Y''\n 

T 

+ c-ir2^Tr(A-i) 

t=l 

+ c-^Y'^Tt (Do) + cV'-^'^ 



-Dt 



where the last inequality follows from Lemma 11 The term 
c^^Y^Tr (Dq) does not depend on T, because 



„— l-5z2rr„ / 7-1 ^ „— lv2j ^ -Y^d 



c-'y^Tt{Do) = c-'Y-'d 



c-b 1 



To show ([32|, note that 



- ^3/2 y(2) j 



2/3 



We thus have that the right term of ( [30| ) is upper bounded. 



max ■ 



< max ■ 



,b + X' 



3X^ + VSX~^ 



,b + X' 



< max I VSX^c, 6 + | < 2XV2^ . 

Using this bound and plugging the value of c from ( (3T| ) we 
bound dSOll, 
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\ W) ) 

= 3 (y2TY^dXy''' 



V-(2) + Y^Td2X, 
2/3 / 1/3 



/ V2TY^dx'' 



-2/3 



which concludes the proof. 



G. Details for the bound 

To show the bound ([33|, note that. 



Y'^dM TY^dM 



We thus have that the right term of pO| ) is upper bounded as 
follows. 



max ■ 
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Using this bound and plugging c 
we bound (OOll, 
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